The competition “United Nations Millennium Development Goals” hosted by DrivenData is used as a basis for our project. More information on that competition can be found under following link: https://www.drivendata.org/competitions/1/united-nations-millennium-development-goals/page/3/
As attending the competition and using all data is too much for the Lab project, the project team agreed on following simplifications:
Our project page and repository is on FH Kufstein GitLab - Drivendata Developmentgoals
The United Nations measures progress on their defined development goals using indicators, such as percent of the population making over one dollar per day. The competition task is to predict the change in these indicators one year and five years into the future.
This will help to understand how to improve on achieving these goals, by uncovering complex relations between these goals and other economic indicators. Given the data from 1972 - 2007, specific indicator for each of the goals should be predicted in 2008 and 2012.
The member states of the UN defined a set of goals to measure the global development in the year 2000. The aim is to increase the standards of living around the world by emphasizing human capital, infrastructure and human rights.
The eight goals are:
The dataset “TestData.csv” has been downloaded from the DrivenData portal and is provided by the World Bank. The data was gathered since the founding of the World Bank in 1944 and is provided to the public.
For the competition, data from the World Bank from 1972 to 2007 was aggregated. It contains over 1200 macroeconomic indicators in 214 countries around the world. Each row represents a timeseries for a specific indicator and country. The row has an id, a country name, a series code, a series name, and data for every year as a column (if available). Missing values are labeled with NaN.
data <- read.csv("data/TrainingSet.csv")
# Understand training data variables
str(data)
## 'data.frame': 195402 obs. of 40 variables:
## $ X : int 0 1 2 4 5 6 8 9 10 11 ...
## $ X1972..YR1972.: num NA NA NA NA NA NA NA NA NA NA ...
## $ X1973..YR1973.: num NA NA NA NA NA NA NA NA NA NA ...
## $ X1974..YR1974.: num NA NA NA NA NA NA NA NA NA NA ...
## $ X1975..YR1975.: num NA NA NA NA NA NA NA NA NA NA ...
## $ X1976..YR1976.: num NA NA NA NA NA NA NA NA NA NA ...
## $ X1977..YR1977.: num NA NA NA NA NA NA NA NA NA NA ...
## $ X1978..YR1978.: num NA NA NA NA NA NA NA NA NA NA ...
## $ X1979..YR1979.: num NA NA NA NA NA NA NA NA NA NA ...
## $ X1980..YR1980.: num NA NA NA NA NA NA NA NA NA NA ...
## $ X1981..YR1981.: num NA NA NA NA NA NA NA NA NA NA ...
## $ X1982..YR1982.: num NA NA NA NA NA NA NA NA NA NA ...
## $ X1983..YR1983.: num NA NA NA NA NA NA NA NA NA NA ...
## $ X1984..YR1984.: num NA NA NA NA NA NA NA NA NA NA ...
## $ X1985..YR1985.: num NA NA NA NA NA NA NA NA NA NA ...
## $ X1986..YR1986.: num NA NA NA NA NA NA NA NA NA NA ...
## $ X1987..YR1987.: num NA NA NA NA NA NA NA NA NA NA ...
## $ X1988..YR1988.: num NA NA NA NA NA NA NA NA NA NA ...
## $ X1989..YR1989.: num NA NA NA NA NA NA NA NA NA NA ...
## $ X1990..YR1990.: num NA NA NA NA NA NA NA NA NA NA ...
## $ X1991..YR1991.: num NA NA NA NA NA NA NA NA NA NA ...
## $ X1992..YR1992.: num NA NA NA NA NA NA NA NA NA NA ...
## $ X1993..YR1993.: num NA NA NA NA NA NA NA NA NA NA ...
## $ X1994..YR1994.: num NA NA NA NA NA NA NA NA NA NA ...
## $ X1995..YR1995.: num NA NA NA NA NA NA NA NA NA NA ...
## $ X1996..YR1996.: num NA NA NA NA NA NA NA NA NA NA ...
## $ X1997..YR1997.: num NA NA NA NA NA NA NA NA NA NA ...
## $ X1998..YR1998.: num NA NA NA NA NA NA NA NA NA NA ...
## $ X1999..YR1999.: num NA NA NA NA NA NA NA NA NA NA ...
## $ X2000..YR2000.: num NA NA NA NA NA NA NA NA NA NA ...
## $ X2001..YR2001.: num NA NA NA NA NA NA NA NA NA NA ...
## $ X2002..YR2002.: num NA NA NA NA NA NA NA NA NA NA ...
## $ X2003..YR2003.: num NA NA NA NA NA NA NA NA NA NA ...
## $ X2004..YR2004.: num NA NA NA NA NA NA NA NA NA NA ...
## $ X2005..YR2005.: num NA NA NA NA NA NA NA NA NA NA ...
## $ X2006..YR2006.: num NA NA NA NA NA NA NA NA NA NA ...
## $ X2007..YR2007.: num 3.77 7.03 8.24 12.93 19 ...
## $ Country.Name : Factor w/ 214 levels "Afghanistan",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ Series.Code : Factor w/ 1305 levels "1.2","2.1","3.2",..: 36 39 33 38 41 35 37 40 34 644 ...
## $ Series.Name : Factor w/ 1305 levels "(%) Benefits held by 1st 20% population - All Social Insurance",..: 1 2 3 5 6 7 9 10 11 12 ...
The dataset includes 195402 obs. and 40 variables. At first glance, we can see already that the data contains a lot of NA which are missing values. We will ignore it for now and focus on checking what variables the dataset contains. These include: X: An integer ID that represents the time series for a specific indicator and country. X1972..YR1972 - X2007..YR2007: A numeric time series variables of the macroeconomic indicators from 1972 to 2007 for many different countries and for many different macroeconomic indicators. Country.Name: A factor variable that contains the 214 countries. Series.Code and Series.Name the different macroeconomic indicators.
As it is noticeable, the variable names are pretty annoying to read. We will try to convert it in a way it is easy to use later in the analysis.
# Rename training data variables
colnames(data) <- (sub('[X][0-9]{4}[.]{2}','', colnames(data)))
colnames(data) <- (sub('\\.$','', colnames(data)))
head(data,2)
## X YR1972 YR1973 YR1974 YR1975 YR1976 YR1977 YR1978 YR1979 YR1980 YR1981
## 1 0 NA NA NA NA NA NA NA NA NA NA
## 2 1 NA NA NA NA NA NA NA NA NA NA
## YR1982 YR1983 YR1984 YR1985 YR1986 YR1987 YR1988 YR1989 YR1990 YR1991 YR1992
## 1 NA NA NA NA NA NA NA NA NA NA NA
## 2 NA NA NA NA NA NA NA NA NA NA NA
## YR1993 YR1994 YR1995 YR1996 YR1997 YR1998 YR1999 YR2000 YR2001 YR2002 YR2003
## 1 NA NA NA NA NA NA NA NA NA NA NA
## 2 NA NA NA NA NA NA NA NA NA NA NA
## YR2004 YR2005 YR2006 YR2007 Country.Name Series.Code
## 1 NA NA NA 3.769214 Afghanistan allsi.bi_q1
## 2 NA NA NA 7.027746 Afghanistan allsp.bi_q1
## Series.Name
## 1 (%) Benefits held by 1st 20% population - All Social Insurance
## 2 (%) Benefits held by 1st 20% population - All Social Protection
Let us try to figure out how to treat missing values. Before doing it, we need to understand the dataset deeper and see how we can treat the missing values.
data.frame("#_of_missing_values"=colSums(is.na(data)))
## X._of_missing_values
## X 0
## YR1972 130457
## YR1973 130959
## YR1974 130436
## YR1975 128429
## YR1976 127685
## YR1977 125667
## YR1978 125639
## YR1979 125496
## YR1980 120152
## YR1981 117368
## YR1982 116386
## YR1983 116420
## YR1984 115870
## YR1985 114385
## YR1986 113947
## YR1987 112650
## YR1988 112160
## YR1989 109071
## YR1990 88447
## YR1991 88411
## YR1992 83159
## YR1993 80849
## YR1994 78579
## YR1995 70934
## YR1996 71028
## YR1997 69716
## YR1998 69458
## YR1999 64522
## YR2000 54855
## YR2001 58619
## YR2002 55087
## YR2003 56243
## YR2004 53023
## YR2005 33858
## YR2006 36514
## YR2007 33806
## Country.Name 0
## Series.Code 0
## Series.Name 0
As we have a total of 195402 observations in the training set, there are lot of missing records per year in the dataset. It is 50% more than the total number of records. This confirms our decision of drastically simplifying the data for our project before we work with it.
As we decided to focus on National development in African countries for this project, we first define the countries which are in Africa and filter them from the dataset.
#Here are the list of African Countries to choose from
african_countries = c(
'Nigeria', 'Ethiopia', 'Egypt', 'Democratic Republic of the Congo',
'South Africa', 'Tanzania', 'Kenya', 'Algeria', 'Uganda',
'Sudan', 'Morocco', 'Ghana', 'Mozambique', 'Ivory Coast',
'Madagascar', 'Angola', 'Cameroon', 'Niger', 'Burkina Faso',
'Mali', 'Malawi', 'Zambia', 'Senegal', 'Zimbabwe', 'Chad',
'Guinea', 'Tunisia', 'Rwanda', 'South Sudan', 'Benin',
'Somalia', 'Burundi', 'Togo', 'Libya', 'Sierra Leone',
'Central African Republic', 'Eritrea', 'Republic of the Congo',
'Liberia', 'Mauritania', 'Gabon', 'Namibia', 'Botswana',
'Lesotho', 'Equatorial Guinea', 'Gambia', 'Guinea-Bissau',
'Mauritius', 'Swaziland', 'Djibouti', 'Reunion (France)',
'Comoros', 'Western Sahara', 'Cape Verde', 'Seychelles'
)
Let us see what the indicators in the global development are:
indicators <- data %>% distinct(Series.Name)
glimpse(indicators)
## Rows: 1,305
## Columns: 1
## $ Series.Name <fct> "(%) Benefits held by 1st 20% population - All Social Ins…
There are 1305 indicators for global development. Manually reading through all these indicators, the ones chosen to be used in the analysis are listed below as ind_subset. The intention when choosing the indicators was to not only use the ones which have an obvious influence on the CO2 emissions, but also to see if we can find a correlations of the CO2 emissions and less obvious indicators.
#africa = world %>%
# filter(continent == "Africa") %>%
# dplyr::select(name_long, subregion) %>%
# st_transform("+proj=aea +lat_1=20 +lat_2=-23 +lat_0=0 +lon_0=25")
africa = world %>%
filter(continent == "Africa", !is.na(iso_a2)) %>%
left_join(worldbank_df, by = "iso_a2") %>%
dplyr::select(name, name_long, subregion, gdpPercap, HDI, pop_growth) %>%
st_transform("+proj=aea +lat_1=20 +lat_2=-23 +lat_0=0 +lon_0=25")
tm_shape(africa) +
tm_fill("darkgreen") +
tm_borders() +
tm_text("name_long", size = 0.3) +
tm_layout(frame = FALSE, title = "Location of Counries in Afrika", title.size = 1, title.position = c(x = 0.42, y = 0.98))
plot(africa["gdpPercap"])
tm_shape(africa) + tm_polygons("subregion")
head(indicators,10)
## Series.Name
## 1 (%) Benefits held by 1st 20% population - All Social Insurance
## 2 (%) Benefits held by 1st 20% population - All Social Protection
## 3 (%) Benefits held by 1st 20% population - All Social Safety Nets
## 4 (%) Generosity of All Social Insurance
## 5 (%) Generosity of All Social Protection
## 6 (%) Generosity of All Social Safety Nets
## 7 (%) Program participation - All Social Insurance
## 8 (%) Program participation - All Social Protection
## 9 (%) Program participation - All Social Safety Nets
## 10 (%) Program participation - Unemp benefits and ALMP
# Predictor list to be included in the analysis
ind_subset <- c('Agricultural land (% of land area)',
'Forest area (% of land area)',
'GDP per capita (current US$)',
'Organic water pollutant (BOD) emissions (kg per day)',
'Population (Total)',
'Tax revenue (% of GDP)',
'Agricultural methane emissions (% of total)',
'Agricultural nitrous oxide emissions (% of total)',
'Electric power consumption (kWh per capita)',
'Electricity production (kWh)',
'Adjusted net national income per capita (current US$)',
'Adjusted savings: net forest depletion (current US$)',
'Fuel exports (% of merchandise exports)',
'Fuel imports (% of merchandise imports)',
'Household final consumption expenditure, etc. (% of GDP)',
'Industry, value added (% of GDP)',
'Rural population (% of total population)',
'Terrestrial protected areas (% of total land area)',
'Alternative and nuclear energy (% of total energy use)',
'Public spending on education, total (% of GDP)',
'CO2 emissions (kt)'
)
Now the data frame is filtered for African Countries and the chosen indicators.
# filter train data for predictors of ind_subset and which are in African Countries list
african_data <- data %>% filter(Series.Name %in% ind_subset) %>% filter(Country.Name %in% african_countries) %>% droplevels()
knitr::kable(head(african_data))
| X | YR1972 | YR1973 | YR1974 | YR1975 | YR1976 | YR1977 | YR1978 | YR1979 | YR1980 | YR1981 | YR1982 | YR1983 | YR1984 | YR1985 | YR1986 | YR1987 | YR1988 | YR1989 | YR1990 | YR1991 | YR1992 | YR1993 | YR1994 | YR1995 | YR1996 | YR1997 | YR1998 | YR1999 | YR2000 | YR2001 | YR2002 | YR2003 | YR2004 | YR2005 | YR2006 | YR2007 | Country.Name | Series.Code | Series.Name |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 2698 | 3.703103e+02 | 4.490870e+02 | 5.977440e+02 | 7.048002e+02 | 7.601826e+02 | 8.708956e+02 | 1.086247e+03 | 1.074418e+03 | 1.453877e+03 | 1.604722e+03 | 1.626582e+03 | 1.735252e+03 | 1.876532e+03 | 1.964064e+03 | 2.196207e+03 | 2.238712e+03 | 1.927841e+03 | 1.778415e+03 | 1.905286e+03 | 1.322028e+03 | 1.397706e+03 | 1.393235e+03 | 1.150915e+03 | 1.083145e+03 | 1.150656e+03 | 1.197771e+03 | 1.225551e+03 | 1.175283e+03 | 1.162427e+03 | 1.219767e+03 | 1.266269e+03 | 1.392367e+03 | 1.709043e+03 | 1.814468e+03 | 2.075911e+03 | 2.532589e+03 | Algeria | NY.ADJ.NNTY.PC.CD | Adjusted net national income per capita (current US$) |
| 2716 | 1.523023e+07 | 2.609364e+07 | 4.856093e+07 | 4.720407e+07 | 5.198167e+07 | 5.343213e+07 | 5.413163e+07 | 6.155376e+07 | 1.007059e+08 | 1.110906e+08 | 8.804325e+07 | 6.025715e+07 | 6.843253e+07 | 6.505040e+07 | 1.004304e+08 | 1.017604e+08 | 1.635064e+08 | 1.391914e+08 | 1.283220e+08 | 1.182529e+08 | 1.644744e+08 | 1.338313e+08 | 1.484143e+08 | 2.451115e+08 | 2.581929e+08 | 1.560767e+08 | 1.250683e+08 | 8.763836e+07 | 7.610964e+07 | 1.149368e+08 | 1.471750e+08 | 2.122871e+08 | 2.011394e+08 | 1.980771e+08 | 1.969198e+08 | 1.812542e+08 | Algeria | NY.ADJ.DFOR.CD | Adjusted savings: net forest depletion (current US$) |
| 2726 | 1.906001e+01 | 1.860153e+01 | 1.861496e+01 | 1.837018e+01 | 1.848271e+01 | 1.840335e+01 | 1.840797e+01 | 1.839831e+01 | 1.840251e+01 | 1.644638e+01 | 1.641951e+01 | 1.649298e+01 | 1.663070e+01 | 1.639600e+01 | 1.624359e+01 | 1.628179e+01 | 1.629775e+01 | 1.627382e+01 | 1.623855e+01 | 1.621588e+01 | 1.631790e+01 | 1.631664e+01 | 1.664329e+01 | 1.664707e+01 | 1.664162e+01 | 1.666429e+01 | 1.672139e+01 | 1.668150e+01 | 1.680326e+01 | 1.684021e+01 | 1.673356e+01 | 1.675485e+01 | 1.727519e+01 | 1.730290e+01 | 1.729030e+01 | 1.732011e+01 | Algeria | AG.LND.AGRI.ZS | Agricultural land (% of land area) |
| 2730 | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | 1.191474e+01 | NA | NA | NA | NA | NA | NA | NA | NA | NA | 9.627003e+00 | NA | NA | NA | NA | 9.760809e+00 | NA | NA | Algeria | EN.ATM.METH.AG.ZS | Agricultural methane emissions (% of total) |
| 2732 | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | 6.385097e+01 | NA | NA | NA | NA | NA | NA | NA | NA | NA | 5.969249e+01 | NA | NA | NA | NA | 5.833232e+01 | NA | NA | Algeria | EN.ATM.NOXE.AG.ZS | Agricultural nitrous oxide emissions (% of total) |
| 2746 | 1.058804e+00 | 1.341085e+00 | 8.241101e-01 | 5.109555e-01 | 5.309654e-01 | 3.280348e-01 | 2.476033e-01 | 2.275048e-01 | 1.972465e-01 | 2.583753e-01 | 2.595926e-01 | 1.178486e-01 | 2.286989e-01 | 3.130946e-01 | 1.072661e-01 | 2.168106e-01 | 7.531740e-02 | 9.380700e-02 | 5.232690e-02 | 1.073660e-01 | 7.096180e-02 | 1.259176e-01 | 6.143160e-02 | 6.890230e-02 | 4.945210e-02 | 2.681040e-02 | 7.429380e-02 | 6.569130e-02 | 1.720140e-02 | 2.191900e-02 | 1.703680e-02 | 7.424050e-02 | 6.969100e-02 | 1.476125e-01 | 5.409480e-02 | 5.282570e-02 | Algeria | EG.USE.COMM.CL.ZS | Alternative and nuclear energy (% of total energy use) |
As we saw that there are many values missing, we want to see which African Countries contain all the indicators that we defined and use only these countries to work with.
# check how many rows of data is left in the filtered data frame
african_data_rows <- african_data %>% group_by(Country.Name) %>% summarise(n = n())
## `summarise()` ungrouping output (override with `.groups` argument)
# filter for countries which have data for all the indicators
countries <- african_data_rows[order(african_data_rows$n, decreasing = TRUE), ] %>% filter(n == length(ind_subset))
countries
## # A tibble: 6 x 2
## Country.Name n
## <fct> <int>
## 1 Botswana 21
## 2 Ethiopia 21
## 3 Ghana 21
## 4 Morocco 21
## 5 South Africa 21
## 6 Zimbabwe 21
There are six countries which contain data for all the indicators we defined to predict the total CO2 emissions. This is a good amount of countries to work with! We will filter again for these countries:
# filter for samples where country.name is in the country list of african countires with all defined predictors
african_data <- african_data %>% filter(Country.Name %in% countries$Country.Name) %>% droplevels()
We want to reshape the data frame so that is easier to work with, having an own column per indicator.
The first idea to ignore the time dependence of the data is to use the mean of all years for each indicator and country.
code <- african_data %>% distinct(Series.Code)
col_filter <- grepl('YR' , colnames(african_data))
#str(african_data[col_filter])
#create new column with average over years for each grouped row of Series.Name
african_data$YearsMean <- rowMeans(african_data[col_filter], na.rm=TRUE)
#Select only needed columns
african_data_cleaned <- african_data %>% dplyr::select(Series.Name, Country.Name, Series.Code, YearsMean)
glimpse(african_data_cleaned)
## Rows: 126
## Columns: 4
## $ Series.Name <fct> "Adjusted net national income per capita (current US$)",…
## $ Country.Name <fct> Botswana, Botswana, Botswana, Botswana, Botswana, Botswa…
## $ Series.Code <fct> NY.ADJ.NNTY.PC.CD, NY.ADJ.DFOR.CD, AG.LND.AGRI.ZS, EN.AT…
## $ YearsMean <dbl> 1.812273e+03, 0.000000e+00, 4.577569e+01, 8.493407e+01, …
head(african_data_cleaned)
## Series.Name Country.Name
## 1 Adjusted net national income per capita (current US$) Botswana
## 2 Adjusted savings: net forest depletion (current US$) Botswana
## 3 Agricultural land (% of land area) Botswana
## 4 Agricultural methane emissions (% of total) Botswana
## 5 Agricultural nitrous oxide emissions (% of total) Botswana
## 6 Alternative and nuclear energy (% of total energy use) Botswana
## Series.Code YearsMean
## 1 NY.ADJ.NNTY.PC.CD 1.812273e+03
## 2 NY.ADJ.DFOR.CD 0.000000e+00
## 3 AG.LND.AGRI.ZS 4.577569e+01
## 4 EN.ATM.METH.AG.ZS 8.493407e+01
## 5 EN.ATM.NOXE.AG.ZS 9.032793e+01
## 6 EG.USE.COMM.CL.ZS 3.027329e-02
# Restructuring data
# transposed row values of Series.Name to columns
african_data_country_pred <- cast(african_data_cleaned, Country.Name~Series.Name)
## Using YearsMean as value column. Use the value argument to cast to override this choice
knitr::kable(head(african_data_country_pred))
| Country.Name | Adjusted net national income per capita (current US$) | Adjusted savings: net forest depletion (current US$) | Agricultural land (% of land area) | Agricultural methane emissions (% of total) | Agricultural nitrous oxide emissions (% of total) | Alternative and nuclear energy (% of total energy use) | CO2 emissions (kt) | Electric power consumption (kWh per capita) | Electricity production (kWh) | Forest area (% of land area) | Fuel exports (% of merchandise exports) | Fuel imports (% of merchandise imports) | GDP per capita (current US$) | Household final consumption expenditure, etc. (% of GDP) | Industry, value added (% of GDP) | Organic water pollutant (BOD) emissions (kg per day) | Population (Total) | Public spending on education, total (% of GDP) | Rural population (% of total population) | Tax revenue (% of GDP) | Terrestrial protected areas (% of total land area) |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Botswana | 1812.2730 | 0 | 45.77569 | 84.93407 | 90.32793 | 0.0302733 | 2324.267 | 919.07932 | 831962963 | 22.430954 | 0.0876977 | 9.947701 | 2235.4833 | 42.29913 | 50.30786 | 3266.934 | 1354196 | 5.886318 | 64.11338 | 23.206016 | 31.209527 |
| Ethiopia | 140.5132 | 1052354145 | 43.88855 | 75.30116 | 89.27969 | 0.4673520 | 3021.710 | 22.46761 | 1329472222 | 13.931971 | 0.9485470 | 15.679137 | 179.6709 | 77.28290 | 11.13892 | 21466.258 | 50293718 | 3.196325 | 87.49728 | 8.791799 | 17.764318 |
| Ghana | 339.3527 | 352566646 | 57.22913 | 41.98221 | 72.22875 | 7.8736548 | 4632.745 | 323.30300 | 5354777778 | 27.811423 | 4.5957970 | 18.007306 | 398.0186 | 82.12050 | 20.75192 | 16048.370 | 14837131 | 4.389222 | 62.70521 | 15.585179 | 14.690243 |
| Morocco | 904.2732 | 6463592 | 65.99445 | 55.34747 | 81.54220 | 1.5809552 | 25108.356 | 372.30558 | 9749805556 | 11.306794 | 2.4982789 | 16.853203 | 1049.8554 | 64.91894 | 31.25429 | 80063.917 | 24085313 | 5.739319 | 52.90811 | 21.492466 | 1.366714 |
| South Africa | 2427.9322 | 127897114 | 79.06925 | 32.72982 | 59.55242 | 1.8325246 | 312464.764 | 4082.29959 | 157117222222 | 7.617737 | 7.2834200 | 6.205613 | 3010.0452 | 58.29583 | 38.05534 | 233517.169 | 35308334 | 5.470022 | 47.44436 | 25.800912 | 6.853430 |
| Zimbabwe | 568.4003 | 0 | 34.96539 | 75.38534 | 89.17435 | 4.1831083 | 12220.991 | 872.68072 | 6862111111 | 50.108569 | 2.9879860 | 15.834904 | 663.1267 | 69.11006 | 30.79621 | 29285.350 | 9821180 | 11.745544 | 71.99439 | 22.276367 | 21.366488 |
Now we see that there are only six rows, one for each country. This might be too less to fit a model to these data.
Instead of using the mean for each country, we decided to use all the samples and just ignore that that data is from different years. So the resulting data frame should look as follows:
# reshape as described above
reshaped <- data.frame()
for (x in colnames(african_data[col_filter])) {
a <- data.frame('value' = african_data[,x], 'Country.Name' = african_data$Country.Name, 'Series.Name' = african_data$Series.Name, 'Year' = x)
b <- cast(a, Country.Name + Year ~ Series.Name)
reshaped <- rbind(reshaped, b)
}
# remove X for year naming
reshaped$Year <- strtoi(substr(reshaped$Year, 3,6))
# show data frame
knitr::kable(head(reshaped,10))
| Country.Name | Year | Adjusted net national income per capita (current US$) | Adjusted savings: net forest depletion (current US$) | Agricultural land (% of land area) | Agricultural methane emissions (% of total) | Agricultural nitrous oxide emissions (% of total) | Alternative and nuclear energy (% of total energy use) | CO2 emissions (kt) | Electric power consumption (kWh per capita) | Electricity production (kWh) | Forest area (% of land area) | Fuel exports (% of merchandise exports) | Fuel imports (% of merchandise imports) | GDP per capita (current US$) | Household final consumption expenditure, etc. (% of GDP) | Industry, value added (% of GDP) | Organic water pollutant (BOD) emissions (kg per day) | Population (Total) | Public spending on education, total (% of GDP) | Rural population (% of total population) | Tax revenue (% of GDP) | Terrestrial protected areas (% of total land area) |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Botswana | 1972 | 204.3396 | 0 | 45.88075 | NA | NA | NA | 22.002 | NA | NA | NA | NA | NA | 223.2795 | 58.89064 | NA | NA | 740118 | 3.90639 | 90.5460 | NA | NA |
| Ethiopia | 1972 | NA | 108811240 | 54.06903 | NA | NA | 0.2163039 | 1408.128 | 18.71545 | 6.1100e+08 | NA | NA | NA | NA | NA | NA | NA | 30135531 | NA | 91.0632 | NA | NA |
| Ghana | 1972 | 210.1584 | 46094493 | 51.41953 | NA | NA | 9.0109350 | 2423.887 | 346.88367 | 3.3570e+09 | NA | 0.8078054 | 11.530897 | 232.5688 | 74.80287 | 19.85830 | NA | 9083737 | 4.62048 | 70.5936 | NA | NA |
| Morocco | 1972 | 275.8059 | 0 | 58.63243 | NA | NA | 5.1867821 | 8049.065 | 139.59813 | 2.6140e+09 | NA | 0.2067663 | 7.157340 | 304.1043 | 72.98565 | NA | NA | 16611970 | NA | 64.2282 | NA | NA |
| South Africa | 1972 | 753.3623 | 0 | 78.47316 | NA | NA | 0.1552646 | 171725.610 | 2405.27269 | 5.9518e+10 | NA | NA | NA | 897.3818 | 58.49684 | 37.72052 | NA | 23126276 | NA | 52.0710 | NA | NA |
| Zimbabwe | 1972 | 431.9784 | 0 | 30.76128 | NA | NA | 5.5611523 | 8225.081 | 839.00294 | 4.3310e+09 | NA | NA | NA | 480.4583 | 65.88515 | 31.21996 | NA | 5573282 | NA | 81.6336 | NA | NA |
| Botswana | 1973 | 291.3054 | 0 | 45.88075 | NA | NA | NA | 51.338 | NA | NA | NA | NA | NA | 318.8310 | 57.71614 | NA | NA | 765685 | 3.26279 | 89.7360 | NA | NA |
| Ethiopia | 1973 | NA | 193920877 | 54.02361 | NA | NA | 0.2243135 | 1752.826 | 17.75724 | 5.9100e+08 | NA | NA | NA | NA | NA | NA | NA | 31029594 | NA | 90.8888 | NA | NA |
| Ghana | 1973 | 236.3150 | 83488954 | 51.41953 | NA | NA | 9.8963503 | 2475.225 | 392.28747 | 3.9100e+09 | NA | 0.6826192 | 8.947857 | 263.6961 | 75.00857 | 20.22423 | NA | 9350286 | NA | 70.3764 | NA | NA |
| Morocco | 1973 | 330.9586 | 0 | 59.20905 | NA | NA | 3.4436158 | 9640.543 | 152.96533 | 2.8750e+09 | NA | 0.5393172 | 6.484498 | 366.5724 | 72.73912 | NA | NA | 16958091 | 4.67169 | 63.5808 | NA | NA |
The data now looks as described above. We see that there are many NA in the data, even though we filtered for countries that have all the indicators available.
# how many samples does the data frame have
nrow(reshaped)
## [1] 216
# how many NA values are there per column
grouper <- reshaped %>%
group_by(Country.Name) %>%
summarise_each(funs(sum(!is.na(.))))
## Warning: `summarise_each_()` is deprecated as of dplyr 0.7.0.
## Please use `across()` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_warnings()` to see where this warning was generated.
## Warning: `funs()` is deprecated as of dplyr 0.8.0.
## Please use a list of either functions or lambdas:
##
## # Simple named list:
## list(mean = mean, median = median)
##
## # Auto named with `tibble::lst()`:
## tibble::lst(mean, median)
##
## # Using lambdas
## list(~ mean(., trim = .2), ~ median(., na.rm = TRUE))
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_warnings()` to see where this warning was generated.
knitr::kable(grouper)
| Country.Name | Year | Adjusted net national income per capita (current US$) | Adjusted savings: net forest depletion (current US$) | Agricultural land (% of land area) | Agricultural methane emissions (% of total) | Agricultural nitrous oxide emissions (% of total) | Alternative and nuclear energy (% of total energy use) | CO2 emissions (kt) | Electric power consumption (kWh per capita) | Electricity production (kWh) | Forest area (% of land area) | Fuel exports (% of merchandise exports) | Fuel imports (% of merchandise imports) | GDP per capita (current US$) | Household final consumption expenditure, etc. (% of GDP) | Industry, value added (% of GDP) | Organic water pollutant (BOD) emissions (kg per day) | Population (Total) | Public spending on education, total (% of GDP) | Rural population (% of total population) | Tax revenue (% of GDP) | Terrestrial protected areas (% of total land area) |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Botswana | 36 | 36 | 36 | 36 | 3 | 3 | 27 | 36 | 27 | 27 | 18 | 8 | 8 | 36 | 36 | 33 | 10 | 36 | 19 | 36 | 9 | 18 |
| Ethiopia | 36 | 27 | 36 | 36 | 3 | 3 | 36 | 36 | 36 | 36 | 18 | 8 | 13 | 27 | 27 | 27 | 18 | 36 | 13 | 36 | 17 | 18 |
| Ghana | 36 | 36 | 36 | 36 | 3 | 3 | 36 | 36 | 36 | 36 | 18 | 24 | 25 | 36 | 36 | 35 | 1 | 36 | 18 | 36 | 11 | 18 |
| Morocco | 36 | 36 | 36 | 36 | 3 | 3 | 36 | 36 | 36 | 36 | 18 | 36 | 36 | 36 | 36 | 28 | 8 | 36 | 29 | 36 | 6 | 18 |
| South Africa | 36 | 36 | 36 | 36 | 3 | 3 | 36 | 36 | 36 | 36 | 18 | 27 | 28 | 36 | 36 | 36 | 16 | 36 | 17 | 36 | 8 | 18 |
| Zimbabwe | 36 | 36 | 36 | 36 | 3 | 3 | 36 | 36 | 36 | 36 | 18 | 19 | 18 | 36 | 36 | 32 | 1 | 36 | 12 | 36 | 8 | 18 |
The table confirms the filtering above: there are values for each country for each indicator available. There are 36 years of data, so having 36 not-NA values is the best case for the indicators above. This is the case for our goal variable CO2 emissions and for some others. However, we see that there are also indicators which have hardly any entries, such as the Agricultural methane and oxide emissions or the organic water pollutant.
We will drop indicators with more than 10% of variables missing:
maxval <- length(reshaped$Country.Name)
check_na <- data.frame("#_of_missing_values"=colSums(is.na(reshaped))/maxval*100)
knitr::kable(check_na)
| X._of_missing_values | |
|---|---|
| Country.Name | 0.000000 |
| Year | 0.000000 |
| Adjusted net national income per capita (current US$) | 4.166667 |
| Adjusted savings: net forest depletion (current US$) | 0.000000 |
| Agricultural land (% of land area) | 0.000000 |
| Agricultural methane emissions (% of total) | 91.666667 |
| Agricultural nitrous oxide emissions (% of total) | 91.666667 |
| Alternative and nuclear energy (% of total energy use) | 4.166667 |
| CO2 emissions (kt) | 0.000000 |
| Electric power consumption (kWh per capita) | 4.166667 |
| Electricity production (kWh) | 4.166667 |
| Forest area (% of land area) | 50.000000 |
| Fuel exports (% of merchandise exports) | 43.518518 |
| Fuel imports (% of merchandise imports) | 40.740741 |
| GDP per capita (current US$) | 4.166667 |
| Household final consumption expenditure, etc. (% of GDP) | 4.166667 |
| Industry, value added (% of GDP) | 11.574074 |
| Organic water pollutant (BOD) emissions (kg per day) | 75.000000 |
| Population (Total) | 0.000000 |
| Public spending on education, total (% of GDP) | 50.000000 |
| Rural population (% of total population) | 0.000000 |
| Tax revenue (% of GDP) | 72.685185 |
| Terrestrial protected areas (% of total land area) | 50.000000 |
filter_na <- check_na %>% rownames_to_column('indicator') %>% filter(X._of_missing_values <=10 )
filter_na
## indicator
## 1 Country.Name
## 2 Year
## 3 Adjusted net national income per capita (current US$)
## 4 Adjusted savings: net forest depletion (current US$)
## 5 Agricultural land (% of land area)
## 6 Alternative and nuclear energy (% of total energy use)
## 7 CO2 emissions (kt)
## 8 Electric power consumption (kWh per capita)
## 9 Electricity production (kWh)
## 10 GDP per capita (current US$)
## 11 Household final consumption expenditure, etc. (% of GDP)
## 12 Population (Total)
## 13 Rural population (% of total population)
## X._of_missing_values
## 1 0.000000
## 2 0.000000
## 3 4.166667
## 4 0.000000
## 5 0.000000
## 6 4.166667
## 7 0.000000
## 8 4.166667
## 9 4.166667
## 10 4.166667
## 11 4.166667
## 12 0.000000
## 13 0.000000
# filter reshaped dataframe for new indicators with NA count less than 50%
reshaped <- reshaped[ ,which((names(reshaped) %in% filter_na$indicator)==TRUE)]
knitr::kable(head(reshaped))
| Country.Name | Year | Adjusted net national income per capita (current US$) | Adjusted savings: net forest depletion (current US$) | Agricultural land (% of land area) | Alternative and nuclear energy (% of total energy use) | CO2 emissions (kt) | Electric power consumption (kWh per capita) | Electricity production (kWh) | GDP per capita (current US$) | Household final consumption expenditure, etc. (% of GDP) | Population (Total) | Rural population (% of total population) |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Botswana | 1972 | 204.3396 | 0 | 45.88075 | NA | 22.002 | NA | NA | 223.2795 | 58.89064 | 740118 | 90.5460 |
| Ethiopia | 1972 | NA | 108811240 | 54.06903 | 0.2163039 | 1408.128 | 18.71545 | 6.1100e+08 | NA | NA | 30135531 | 91.0632 |
| Ghana | 1972 | 210.1584 | 46094493 | 51.41953 | 9.0109350 | 2423.887 | 346.88367 | 3.3570e+09 | 232.5688 | 74.80287 | 9083737 | 70.5936 |
| Morocco | 1972 | 275.8059 | 0 | 58.63243 | 5.1867821 | 8049.065 | 139.59813 | 2.6140e+09 | 304.1043 | 72.98565 | 16611970 | 64.2282 |
| South Africa | 1972 | 753.3623 | 0 | 78.47316 | 0.1552646 | 171725.610 | 2405.27269 | 5.9518e+10 | 897.3818 | 58.49684 | 23126276 | 52.0710 |
| Zimbabwe | 1972 | 431.9784 | 0 | 30.76128 | 5.5611523 | 8225.081 | 839.00294 | 4.3310e+09 | 480.4583 | 65.88515 | 5573282 | 81.6336 |
# used for imputation later:
var_num <- filter_na$indicator[3:nrow(filter_na)]
Botswana, officially the Republic of Botswana, is a landlocked country in Southern Africa. It was becoming independent from the Commonwealth on 30 September 1966. Since then, it has been a representative republic, with a consistent record of uninterrupted democratic elections and the lowest perceived corruption ranking in Africa since at least 1998. It is currently Africa’s oldest continuous democracy. Botswana is topographically flat, with up to 70 percent of its territory being the Kalahari Desert. It is bordered by South Africa to the south and southeast, Namibia to the west and north, and Zimbabwe to the northeast. Its border with Zambia to the north near Kazungula is poorly defined but is, at most, a few hundred metres long.
botswana = world %>%
filter(name_long == "Botswana", !is.na(iso_a2))
#tm_shape(africa) + tm_polygons("name_long")
tm_shape(africa) +
tm_fill("lightgrey") +
tm_borders() +
tm_text("name_long", size = 0.3) +
tm_shape(botswana) +
tm_fill("darkgreen") +
tm_text("name_long", size = 0.5) +
tm_layout(frame = FALSE, title = "Location of Botswana in Afrika", title.size = 1, title.position = c(x = 0.42, y = 0.98))
A mid-sized country of just over 2.3 million people, Botswana is one of the most sparsely populated countries in the world. Around 10 percent of the population lives in the capital and largest city, Gaborone. Formerly one of the poorest countries in the world—with a GDP per capita of about US70 per year in the late 1960s—Botswana has since transformed itself into an upper middle income country, with one of the world’s fastest-growing economies. The economy is dominated by mining, cattle, and tourism. Botswana boasts a GDP (purchasing power parity) per capita of about $18,825 per year as of 2015, which is one of the highest in Africa. Its high gross national income (by some estimates the fourth-largest in Africa) gives the country a relatively high standard of living and the highest Human Development Index of continental Sub-Saharan Africa.
Botswana is a member of the African Union, the Southern African Development Community, the Commonwealth of Nations, and the United Nations.
p1 <- ggdraw() +
draw_image("img/botswana_city.jpg", scale = 0.9) +
draw_label("Botswana City", color = "Grey", size = 20, angle = 0, x = 0.5, y = 0.82)
p2 <- ggdraw() +
draw_image("img/botswana_rural.jpg", scale = 0.9) +
draw_label("Botswana Rural", color = "Grey", size = 20, angle = 0, x = 0.5, y = 0.82)
plot_grid(p1, p2)
Ethiopia, officially the Federal Democratic Republic of Ethiopia, is a landlocked country in the Horn of Africa. It shares borders with Eritrea to the north, Djibouti to the northeast, Somalia to the east, Kenya to the south, South Sudan to the west and Sudan to the northwest. With over 109 million inhabitants as of 2019, Ethiopia is the most populous landlocked country in the world and the second-most populous nation on the African continent. The country has a total area of 1,100,000 square kilometres. Its capital and largest city is Addis Ababa, which lies a few miles west of the East African Rift that splits the country into the Nubian and Somali tectonic plates. Ethiopian national identity is grounded in the indigenous Amharic language, the historic and contemporary roles of Christianity and Islam, and the independence of Ethiopia from foreign rule, stemming from the various ancient Ethiopian kingdoms of antiquity.
ethiopia = world %>%
filter(name_long == "Ethiopia", !is.na(iso_a2))
tm_shape(africa) +
tm_fill("lightgrey") +
tm_borders() +
tm_text("name_long", size = 0.3) +
tm_shape(ethiopia) +
tm_fill("darkgreen") +
tm_text("name_long", size = 0.5) +
tm_layout(frame = FALSE, title = "Location of Ethiopia in Afrika", title.size = 1, title.position = c(x = 0.42, y = 0.98))
According to the IMF, Ethiopia was one of the fastest growing economies in the world, registering over 10% economic growth from 2004 through 2009. It was the fastest-growing non-oil-dependent African economy in the years 2007 and 2008. In 2015, the World Bank highlighted that Ethiopia had witnessed rapid economic growth with real domestic product (GDP) growth averaging 10.9% between 2004 and 2014.
In 2008 and 2011, Ethiopia’s growth performance and considerable development gains were challenged by high inflation and a difficult balance of payments situation. Inflation surged to 40% in August 2011 because of loose monetary policy, large civil service wage increase in early 2011, and high food prices. For 2011/12, end-year inflation was projected to be about 22%, and single digit inflation is projected in 2012/13 with the implementation of tight monetary and fiscal policies.
In spite of fast growth in recent years, GDP per capita is one of the lowest in the world, and the economy faces a number of serious structural problems. However, with a focused investment in public infrastructure and industrial parks, Ethiopia’s economy is addressing its structural problems to become a hub for light manufacturing in Africa.[220] In 2019 a law was passed allowing expatriate Ethiopians to invest in Ethiopia’s financial service industry.
p1 <- ggdraw() +
draw_image("img/ethiopia_city.jpg", scale = 0.9) +
draw_label("Ethiopia City", color = "Grey", size = 20, angle = 0, x = 0.5, y = 0.82)
p2 <- ggdraw() +
draw_image("img/ethiopia_rural.jpg", scale = 0.9) +
draw_label("Ethiopia Rural", color = "Grey", size = 20, angle = 0, x = 0.5, y = 0.82)
plot_grid(p1, p2)
Ghana, officially the Republic of Ghana, is a country located along the Gulf of Guinea and Atlantic Ocean, in the subregion of West Africa. Spanning a land mass of 238,535 km2, Ghana is bordered by the Ivory Coast in the west, Burkina Faso in the north, Togo in the east, and the Gulf of Guinea and Atlantic Ocean in the south. Ghana means “Warrior King” in the Soninke language.
Ghana’s population of approximately 30 million spans a variety of ethnic, linguistic and religious groups. According to the 2010 census, 71.2% of the population was Christian, 17.6% was Muslim, and 5.2% practised traditional faiths. Its diverse geography and ecology ranges from coastal savannahs to tropical rain forests.
Ghana is a unitary constitutional democracy led by a president who is both head of state and head of the government. Ghana’s growing economic prosperity and democratic political system have made it a regional power in West Africa. It is a member of the Non-Aligned Movement, the African Union, the Economic Community of West African States (ECOWAS), Group of 24 (G24) and the Commonwealth of Nations.
ghana = world %>%
filter(name_long == "Ghana", !is.na(iso_a2))
tm_shape(africa) +
tm_fill("lightgrey") +
tm_borders() +
tm_text("name_long", size = 0.3) +
tm_shape(ghana) +
tm_fill("darkgreen") +
tm_text("name_long", size = 0.5) +
tm_layout(frame = FALSE, title = "Location of Ghana in Afrika", title.size = 1, title.position = c(x = 0.42, y = 0.98))
Ghana is an average natural resource enriched country possessing industrial minerals, hydrocarbons and precious metals. It is an emerging designated digital economy with mixed economy hybridisation and an emerging market with 8.7% GDP growth in 2012. It has an economic plan target known as the “Ghana Vision 2020”. This plan envisions Ghana as the first African country to become a developed country between 2020 and 2029 and a newly industrialised country between 2030 and 2039. This excludes fellow Group of 24 member and Sub-Saharan African country South Africa, which is a newly industrialised country. Ghana’s economy also has ties to the Chinese yuan renminbi along with Ghana’s vast gold reserves. In 2013, the Bank of Ghana began circulating the renminbi throughout Ghanaian state-owned banks and to the Ghana public as hard currency along with the national Ghana cedi for second national trade currency. Between 2012 and 2013, 37.9 percent of rural dwellers were experiencing poverty whereas only 10.6 percent of urban dwellers were. Urban areas hold greater opportunity for employment, particularly in informal trade, while nearly all (94 percent) of rural poor households participate in the agricultural sector.
p1 <- ggdraw() +
draw_image("img/ghana_city.jpg", scale = 0.9) +
draw_label("Ghana City", color = "Grey", size = 20, angle = 0, x = 0.5, y = 0.82)
p2 <- ggdraw() +
draw_image("img/ghana_rural.jpg", scale = 0.9) +
draw_label("Ghana Rural", color = "Grey", size = 20, angle = 0, x = 0.5, y = 0.82)
plot_grid(p1, p2)
Morocco, officially the Kingdom of Morocco, is a country located in the Maghreb region of North Africa. It overlooks the Mediterranean Sea to the north and the Atlantic Ocean to the west, with land borders with Algeria to the east and Western Sahara to the south (status disputed). Morocco also claims the exclaves of Ceuta, Melilla and Peñón de Vélez de la Gomera, all of them under Spanish jurisdiction, as well as several small Spanish-controlled islands off its coast. The capital is Rabat and the largest city is Casablanca. Morocco spans an area of 710,850 km2 (274,460 sq mi) and has a population of over 36 million.
morocco = world %>%
filter(name_long == "Morocco", !is.na(iso_a2))
tm_shape(africa) +
tm_fill("lightgrey") +
tm_borders() +
tm_text("name_long", size = 0.3) +
tm_shape(morocco) +
tm_fill("darkgreen") +
tm_text("name_long", size = 0.5) +
tm_layout(frame = FALSE, title = "Location of Morocco in Afrika", title.size = 1, title.position = c(x = 0.42, y = 0.98))
Morocco’s economy is considered a relatively liberal economy governed by the law of supply and demand. Since 1993, the country has followed a policy of privatisation of certain economic sectors which used to be in the hands of the government. Morocco has become a major player in African economic affairs, and is the 5th African economy by GDP (PPP). Morocco was ranked as the first African country by the Economist Intelligence Unit’s quality-of-life index, ahead of South Africa.[citation needed] However, in the years since that first-place ranking was given, Morocco has slipped into fourth place behind Egypt.
Government reforms and steady yearly growth in the region of 4–5% from 2000 to 2007, including 4.9% year-on-year growth in 2003–2007 helped the Moroccan economy to become much more robust compared to a few years earlier. For 2012 the World Bank forecast a rate of 4% growth for Morocco and 4.2% for following year, 2013.
The services sector accounts for just over half of GDP and industry, made up of mining, construction and manufacturing, is an additional quarter. The industries that recorded the highest growth are tourism, telecoms, information technology, and textile.
p1 <- ggdraw() +
draw_image("img/morocco_city.jpg", scale = 0.9) +
draw_label("Morocco City", color = "Grey", size = 20, angle = 0, x = 0.5, y = 0.82)
p2 <- ggdraw() +
draw_image("img/morocco_rural.jpg", scale = 0.9) +
draw_label("Morocco Rural", color = "Grey", size = 20, angle = 0, x = 0.5, y = 0.82)
plot_grid(p1, p2)
South Africa, officially the Republic of South Africa (RSA), is the southernmost country in Africa. With over 58 million people, it is the world’s 24th-most populous nation and covers an area of 1,221,037 square kilometres. South Africa has three designated capital cities: executive Pretoria, judicial Bloemfontein and legislative Cape Town. The largest city is Johannesburg. About 80% of South Africans are of Bantu ancestry, divided among a variety of ethnic groups speaking different African languages. The remaining population consists of Africa’s largest communities of European, Asian, Indian, and multiracial ancestry.
It is bounded to the south by 2,798 kilometres of coastline of Southern Africa stretching along the South Atlantic and Indian Oceans; to the north by the neighbouring countries of Namibia, Botswana, and Zimbabwe; and to the east and northeast by Mozambique and Eswatini (former Swaziland); and it surrounds the enclaved country of Lesotho. It is the southernmost country on the mainland of the Old World or the Eastern Hemisphere, and the most populous country located entirely south of the equator.
south_africa = world %>%
filter(name_long == "South Africa", !is.na(iso_a2))
tm_shape(africa) +
tm_fill("lightgrey") +
tm_borders() +
tm_text("name_long", size = 0.3) +
tm_shape(south_africa) +
tm_fill("darkgreen") +
tm_text("name_long", size = 0.5) +
tm_layout(frame = FALSE, title = "Location of South Africa", title.size = 1, title.position = c(x = 0.42, y = 0.98))
South Africa is a multiethnic society encompassing a wide variety of cultures, languages, and religions. Its pluralistic makeup is reflected in the constitution’s recognition of 11 official languages, the fourth-highest number in the world. Two are of European origin: Afrikaans developed from Dutch and serves as the first language of most coloured and white South Africans; English reflects the legacy of British colonialism, and is commonly used in public and commercial life, though it is fourth-ranked as a spoken first language. The country is one of the few in Africa never to have had a coup d’état, and regular elections have been held for almost a century. However, the vast majority of black South Africans were not enfranchised until 1994.
South Africa is a developing country and ranks 113th on the Human Development Index, the seventh-highest in Africa. It has been classified by the World Bank as a newly industrialised country, with the second-largest nominal GDP in Africa, and the 33rd-largest in the world. The country is a middle power in international affairs; it maintains significant regional influence and is a member of the G20. However, crime, poverty and inequality remain widespread, with about a quarter of the population unemployed and living on less than US$1.25 a day.
p1 <- ggdraw() +
draw_image("img/south_africa_city.jpg", scale = 0.9) +
draw_label("South Africa City", color = "Grey", size = 20, angle = 0, x = 0.5, y = 0.82)
p2 <- ggdraw() +
draw_image("img/south_africa_rural.jpg", scale = 0.9) +
draw_label("South Africa Rural", color = "Grey", size = 20, angle = 0, x = 0.5, y = 0.82)
plot_grid(p1, p2)
Zimbabwe, officially the Republic of Zimbabwe, formerly Rhodesia, is a landlocked country located in Southern Africa, between the Zambezi and Limpopo Rivers, bordered by South Africa, Botswana, Zambia and Mozambique. The capital and largest city is Harare. The second largest city is Bulawayo. A country of roughly 14 million people, Zimbabwe has 16 official languages, with English, Shona, and Ndebele the most common.
Robert Mugabe became Prime Minister of Zimbabwe in 1980, when his ZANU–PF party won the elections following the end of white minority rule; he was the President of Zimbabwe from 1987 until his resignation in 2017. Under Mugabe’s authoritarian regime, the state security apparatus dominated the country and was responsible for widespread human rights violations. Mugabe maintained the revolutionary socialist rhetoric of the Cold War era, blaming Zimbabwe’s economic woes on conspiring Western capitalist countries. Contemporary African political leaders were reluctant to criticise Mugabe, who was burnished by his anti-imperialist credentials, though Archbishop Desmond Tutu called him “a cartoon figure of an archetypal African dictator”. The country has been in economic decline since the 1990s, experiencing several crashes and hyperinflation along the way.
zimbabwe = world %>%
filter(name_long == "Zimbabwe", !is.na(iso_a2))
tm_shape(africa) +
tm_fill("lightgrey") +
tm_borders() +
tm_text("name_long", size = 0.3) +
tm_shape(zimbabwe) +
tm_fill("darkgreen") +
tm_text("name_long", size = 0.5) +
tm_layout(frame = FALSE, title = "Location of Zimbabwe", title.size = 1, title.position = c(x = 0.42, y = 0.98))
Minerals, gold, and agriculture are the main foreign exports of Zimbabwe. Tourism also plays a key role in its economy.
p1 <- ggdraw() +
draw_image("img/zimbabwe_city.jpg", scale = 0.9) +
draw_label("Zimbabwe City", color = "Grey", size = 20, angle = 0, x = 0.5, y = 0.82)
p2 <- ggdraw() +
draw_image("img/zimbabwe_rural.jpg", scale = 0.9) +
draw_label("Zimbabwe Rural", color = "Grey", size = 20, angle = 0, x = 0.5, y = 0.82)
plot_grid(p1, p2)
With the adapted data frame for our project, a main work will be to understand the data better and find correlations, before fitting a model.
ggplot(reshaped, aes(x = Year, y = `Population (Total)`, color = Country.Name)) +
geom_point() +
ggtitle('Population over Time') +
xlab('Time [years]') +
ylab('Population per Country')
ggplot(reshaped, aes(x = Year, y = `CO2 emissions (kt)`, color = Country.Name)) +
geom_point() +
ggtitle('CO2 Emissions over Time') +
xlab('Time [years]') +
ylab('CO2 emissions [kt] per Country')
ggplot(reshaped, aes(x = Year, y = `CO2 emissions (kt)`/`Population (Total)`, color = Country.Name)) +
geom_point() +
ggtitle('CO2 Emissions normalized over Time') +
xlab('Time [years]') +
ylab('CO2 emissions [kt] per capita')
ggplot(reshaped, aes(x = Year, y = `GDP per capita (current US$)`, color = Country.Name)) +
geom_point() +
ggtitle('GDP per capita over Time') +
xlab('Time [years]') +
ylab('GDP per capita (current US$)')
ggplot(reshaped, aes(x = Year, y = `Household final consumption expenditure, etc. (% of GDP)`, color = Country.Name)) +
geom_point() +
ggtitle('Household final consumption expenditure') +
xlab('Time [years]') +
ylab('Household final consumption expenditure, etc. (% of GDP)')
The plots above give a first impression on the data by plotting them over time. We can see that the population is growing in every country over the years, the highest growth can be seen in Ethiopia. The second and third plot show th CO2 emissions, also over time. For the third plot, the emissions where normalized by using the population in the country. As a result, we see for example in South Africa that the absolute emissions are growing very fast, however if we view in relatively to the population growth, it reached a peak around 1985 and and then became a bit lower again. So it might be a good approach to scale data to the population of the country.
In most of the countries, we see a high increase of GDP per capita after 2000. The household final consumption expenditure stays quite stable for most countries over time, except for Botswana and Zimbabwe.
ggplot(reshaped, aes(x = `CO2 emissions (kt)`, y = `GDP per capita (current US$)`, color = Country.Name)) +
geom_point() +
ggtitle('GDP per Capita over CO2 emissions')+
xlab('CO2 emissions (kt)') +
ylab('GDP per capita (US$)')
The plot above shows that there seems to be a linear relationship between the GDP per capita and the CO2 emissions. A high GDP might indicate more industry as well as a higher wealth in the country, which would be a reason for higher emissions.
Simple Modelling with the help of tideyverse and modelr. Here we are using the transpose african_data reshaped and assigned it to the reshaped_african variable.
reshaped_african <- reshaped
In working with large dataset, it is safe to impute missing values with median as it is robust to outlier. In this project we imputed median to all missing element in the dataset.
library(data.table)
#iterrate through numeric columns in the dataset and impute median
for(k in names(reshaped_african)){
if(k %in% var_num){
# impute numeric variables with median
med <- median(reshaped_african[[k]],na.rm = T)
set(x = reshaped_african, which(is.na(reshaped_african[[k]])), k, med)
}
}
If we check here, our dataset does not have missing values anymore.
‘indicator’ = names(reshaped_african)
check_na <- data.frame("#_of_missing_values"=colSums(is.na(reshaped_african)))
check_na
## X._of_missing_values
## Country.Name 0
## Year 0
## Adjusted net national income per capita (current US$) 0
## Adjusted savings: net forest depletion (current US$) 0
## Agricultural land (% of land area) 0
## Alternative and nuclear energy (% of total energy use) 0
## CO2 emissions (kt) 0
## Electric power consumption (kWh per capita) 0
## Electricity production (kWh) 0
## GDP per capita (current US$) 0
## Household final consumption expenditure, etc. (% of GDP) 0
## Population (Total) 0
## Rural population (% of total population) 0
The above are the final indicators that will be used for the analysis, Before we proceed we pre-process the data set further.
To make the values between the countries comparable, some indicators still need to be normalized. As a base for normalization, the Population of the country is used. Only indicators which are not already in per cent or per capita are normalized in this next step.
# normalize data to population of country
reshaped_african$`Adjusted savings: net forest depletion per capita (current US$)` <-
reshaped_african$`Adjusted savings: net forest depletion (current US$)`/reshaped_african$`Population (Total)`
reshaped_african$`CO2 emissions per capita (kt)` <-
reshaped_african$`CO2 emissions (kt)`/reshaped_african$`Population (Total)`
reshaped_african$`Electricity production per capita (kWh)` <-
reshaped_african$`Electricity production (kWh)`/reshaped_african$`Population (Total)`
# delete original columns
reshaped_african <- reshaped_african %>% dplyr::select(-c(`Adjusted savings: net forest depletion (current US$)`,`CO2 emissions (kt)`,`Electricity production (kWh)` ))
We scale and center the data so that they are all in similar magnitudes.
reshaped_african <- reshaped_african %>%
mutate_at(names(reshaped_african)[3:length(reshaped_african)], funs(c(scale(.))))
knitr::kable(summary(reshaped_african))
| Country.Name | Year | Adjusted net national income per capita (current US$) | Agricultural land (% of land area) | Alternative and nuclear energy (% of total energy use) | Electric power consumption (kWh per capita) | GDP per capita (current US$) | Household final consumption expenditure, etc. (% of GDP) | Population (Total) | Rural population (% of total population) | Adjusted savings: net forest depletion per capita (current US$) | CO2 emissions per capita (kt) | Electricity production per capita (kWh) | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Botswana :36 | Min. :1972 | Min. :-0.9380 | Min. :-1.53901 | Min. :-0.9199 | Min. :-0.7546 | Min. :-0.9025 | Min. :-2.683718 | Min. :-1.2105 | Min. :-1.58518 | Min. :-0.6547 | Min. :-0.6933 | Min. :-0.6843 | |
| Ethiopia :36 | 1st Qu.:1981 | 1st Qu.:-0.6945 | 1st Qu.:-0.56926 | 1st Qu.:-0.8027 | 1st Qu.:-0.5790 | 1st Qu.:-0.6896 | 1st Qu.:-0.457847 | 1st Qu.:-0.7211 | 1st Qu.:-0.86623 | 1st Qu.:-0.6547 | 1st Qu.:-0.6154 | 1st Qu.:-0.5284 | |
| Ghana :36 | Median :1990 | Median :-0.4062 | Median :-0.07259 | Median :-0.3102 | Median :-0.3846 | Median :-0.4299 | Median : 0.007807 | Median :-0.1947 | Median :-0.05628 | Median :-0.6342 | Median :-0.3841 | Median :-0.4434 | |
| Morocco :36 | Mean :1990 | Mean : 0.0000 | Mean : 0.00000 | Mean : 0.0000 | Mean : 0.0000 | Mean : 0.0000 | Mean : 0.000000 | Mean : 0.0000 | Mean : 0.00000 | Mean : 0.0000 | Mean : 0.0000 | Mean : 0.0000 | |
| South Africa:36 | 3rd Qu.:1998 | 3rd Qu.: 0.3074 | 3rd Qu.: 0.80824 | 3rd Qu.: 0.3644 | 3rd Qu.:-0.1167 | 3rd Qu.: 0.3077 | 3rd Qu.: 0.697468 | 3rd Qu.: 0.5321 | 3rd Qu.: 0.88930 | 3rd Qu.: 0.4839 | 3rd Qu.:-0.1705 | 3rd Qu.:-0.2548 | |
| Zimbabwe :36 | Max. :2007 | Max. : 3.5011 | Max. : 1.67768 | Max. : 2.8722 | Max. : 2.8113 | Max. : 3.5316 | Max. : 2.618432 | Max. : 3.1996 | Max. : 1.71421 | Max. : 4.9175 | Max. : 2.6691 | Max. : 3.4792 |
We finally have the normalized dataset that is free from missing values. The next thing is to see how these african countries indicators are correlated from each other
library(corrplot)
#plot.new()
options(repr.plot.width = 20, repr.plot.height = 15)
reshaped_african_corr <- reshaped_african[sapply(reshaped_african, is.numeric)]
# on some
#corr <- cor(array(reshaped_african_corr))
corr <- cor(reshaped_african_corr)
corrplot <- corrplot(corr,method ="number")
Our selection of the column names:
colnames(reshaped_african_corr)
## [1] "Year"
## [2] "Adjusted net national income per capita (current US$)"
## [3] "Agricultural land (% of land area)"
## [4] "Alternative and nuclear energy (% of total energy use)"
## [5] "Electric power consumption (kWh per capita)"
## [6] "GDP per capita (current US$)"
## [7] "Household final consumption expenditure, etc. (% of GDP)"
## [8] "Population (Total)"
## [9] "Rural population (% of total population)"
## [10] "Adjusted savings: net forest depletion per capita (current US$)"
## [11] "CO2 emissions per capita (kt)"
## [12] "Electricity production per capita (kWh)"
It seems to be that we have quite correlated indicators. A few examples includes:
“Adjusted net national income per capita (current US$)” is highly positively correlated with “CO2 emissions (kt)” , “Electric power consumption (kWh per capita)”, “Electricity production (kWh)”
“CO2 emissions (kt)” is also highly positively correlated with “Electric power consumption (kWh per capita)”, “Electricity production (kWh)”, “GDP per capita (current US$)”, “CO2 emissions per capita (kt)”, “Electricity production per capita (kWh)”
Let us see how these few indicators are distributed all over the dataset.
national_income <- ggplot(reshaped_african) +
geom_density(aes(`Adjusted net national income per capita (current US$)`))
ep_consumtion <- ggplot(reshaped_african) +
geom_density(aes(`Electric power consumption (kWh per capita)`))
forest_depletion <- ggplot(reshaped_african) +
geom_density(aes(`Adjusted savings: net forest depletion per capita (current US$)`))
gdp_per_capita <- ggplot(reshaped_african) +
geom_density(aes(`GDP per capita (current US$)`))
co2_per_capita <- ggplot(reshaped_african) +
geom_density(aes(`CO2 emissions per capita (kt)`))
AgriLand <- ggplot(reshaped_african) +
geom_density(aes(`Agricultural land (% of land area)`))
household_exp <- ggplot(reshaped_african) +
geom_density(aes(`Household final consumption expenditure, etc. (% of GDP)`))
grid.arrange(national_income, ep_consumtion, forest_depletion, co2_per_capita, AgriLand, household_exp)
These are 6 out of 11 african countries economic indicators that we graphed here. It can be seen that most of the indicators are right skewed except the Agricultural land and the household final consumption expenditure.
knitr::kable(head(reshaped_african))
| Country.Name | Year | Adjusted net national income per capita (current US$) | Agricultural land (% of land area) | Alternative and nuclear energy (% of total energy use) | Electric power consumption (kWh per capita) | GDP per capita (current US$) | Household final consumption expenditure, etc. (% of GDP) | Population (Total) | Rural population (% of total population) | Adjusted savings: net forest depletion per capita (current US$) | CO2 emissions per capita (kt) | Electricity production per capita (kWh) |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Botswana | 1972 | -0.8156565 | -0.5468912 | -0.3101774 | -0.3845575 | -0.8151440 | -0.4295774 | -1.2105083 | 1.6809011 | -0.6546686 | -0.6932936 | 3.4792487 |
| Ethiopia | 1972 | -0.4062387 | -0.0265650 | -0.8471225 | -0.7519478 | -0.4299202 | 0.0078072 | 0.4160475 | 1.7142072 | -0.3257537 | -0.6877591 | -0.6821736 |
| Ghana | 1972 | -0.8100608 | -0.1949281 | 2.1101067 | -0.5200520 | -0.8079691 | 0.6587629 | -0.7488254 | 0.3960286 | -0.1924226 | -0.6160943 | -0.4988118 |
| Morocco | 1972 | -0.7469308 | 0.2634178 | 0.8242203 | -0.6665277 | -0.7527157 | 0.5344715 | -0.3322607 | -0.0138833 | -0.6546686 | -0.5452161 | -0.6102112 |
| South Africa | 1972 | -0.2876881 | 1.5242023 | -0.8676472 | 0.9344819 | -0.2944736 | -0.4565120 | 0.0281997 | -0.7967691 | -0.6546686 | 1.7146694 | 0.6582275 |
| Zimbabwe | 1972 | -0.5967474 | -1.5076627 | 0.9501038 | -0.1723023 | -0.6165015 | 0.0488226 | -0.9430717 | 1.1069703 | -0.6546686 | -0.2224742 | -0.2848697 |
Before we proceed to modelling, we first rename the indicators so that we do not run into problem when we do the modelling. The below are the names of the indicators
adj_net_national_income : Adjusted net national income per capita (current US $) adj_savings_net_forest_deplet : Adjusted savings: net forest depletion (current US $) agri_land : Agricultural land (% of land area)
alter_and_nuclear_energy : Alternative and nuclear energy (percentage of total energy use)
elect_power_consump : Electric power consumption (kWh per capita)
gdp_per_cap : GDP per capita (current US Dollar)
household_consump_expend : Household final consumption expenditure, etc. (percentage of GDP)
total_pop : Population (Total)
rural_pop : Rural population (% of total population)
adj_savings_per_cap : Adjusted savings: net forest depletion per capita (current US $)
co2_per_cap : CO2 emissions per capita (kt)
elect_prod_per_cap : Electricity production per capita (kWh)
print(names(reshaped_african))
## [1] "Country.Name"
## [2] "Year"
## [3] "Adjusted net national income per capita (current US$)"
## [4] "Agricultural land (% of land area)"
## [5] "Alternative and nuclear energy (% of total energy use)"
## [6] "Electric power consumption (kWh per capita)"
## [7] "GDP per capita (current US$)"
## [8] "Household final consumption expenditure, etc. (% of GDP)"
## [9] "Population (Total)"
## [10] "Rural population (% of total population)"
## [11] "Adjusted savings: net forest depletion per capita (current US$)"
## [12] "CO2 emissions per capita (kt)"
## [13] "Electricity production per capita (kWh)"
names(reshaped_african) <- c("country", "Year", "adj_net_national_income", "agri_land", "alter_and_nuclear_energy",
"elect_power_consump", "gdp_per_cap", "household_consump_expend", "total_pop", "rural_pop",
"adj_savings_per_cap", "co2_per_cap", "elect_prod_per_cap")
print(names(reshaped_african))
## [1] "country" "Year"
## [3] "adj_net_national_income" "agri_land"
## [5] "alter_and_nuclear_energy" "elect_power_consump"
## [7] "gdp_per_cap" "household_consump_expend"
## [9] "total_pop" "rural_pop"
## [11] "adj_savings_per_cap" "co2_per_cap"
## [13] "elect_prod_per_cap"
Since most of the indicators are not normally distributed, it could affect our analysis if we are not careful. After filtering the most influencial indicators to African countries, we are left with 126 observations. If we drop outlier values, we lost the information in the terms of the variability in the data. One option to treat these outliers is to replace it with the median. Other option is to use powerful model that is robust to outliers. We use random forest model. It is based on the decision tree base learner where it isolate observations into small leaves. In the case of regression , it is generally a very low-order regression model (usually only the average of the observations in the leaf). Therefore, for regression, extreme values do not affect the entire model because they get averaged locally. So the fit to the other values is not affected.
We will predict the CO2 emission, given the rest of the indicators. Carbon emissions affect the planet significantly, as they are the greenhouse gas with the highest levels of emissions in the atmosphere. This, of course, causes global warming and ultimately, climate change. This warming causes extreme weather events like tropical storms, wildfires, severe droughts and heat waves.
We will predict the carbon emission of all filtered African countries and see the indicators that contribute to these countries emitting carbon in the atmosphere.
set.seed(123)
#fit random forest model
african_rf_model <- randomForest(co2_per_cap ~ . -Year, data = reshaped_african)
#predict
african_rf_pred <- predict(african_rf_model, new_data = reshaped_african)
#Add the predicted values in the original data for easy plotting
african_co2 <- reshaped_african %>% mutate(co2_pred = african_rf_pred)
#plot the predicted minus actual
south_africa <- ggplot() + theme_bw() +
geom_point(
data = african_co2,
mapping = aes(x = Year, y = co2_per_cap, color = "real")
) +
geom_point(
data = african_co2,
mapping = aes(x = Year, y = co2_pred, color = "predicted")
) +
ggtitle("Carbon Emission in Africa") +
theme(plot.title = element_text(hjust = 0.5)) +
ylab("CO2 emissions per capita")
#variable importance
rfImp <- vip(african_rf_model, color = 'red', fill='light blue') +
ggtitle('Variable Importance') +
theme(plot.title = element_text(hjust = 0.5))
grid.arrange(south_africa, rfImp)
As we can see in the result, there is one country that is emitting much carbon dioxide than the rest of the filtered African countries. Overall, it is the agricultural land and electric power consumption of these countries emitting CO2. But which country is this? We will apply the random forest model to each country to figure out indicators that contribute to them emitting co2
set.seed(28)
# define function for linear model
func_rf_model <- function(countryname) {
# Create a dataset for south africa
african_rf_data <- filter(reshaped_african, country==countryname)
african_rf_data <- african_rf_data[c(2:13)]
#fit random forest model
african_rf_model <- randomForest(co2_per_cap ~ . -Year, data = african_rf_data)
african_rf_pred <- predict(african_rf_model, new_data = african_rf_data)
#Add the predicted values in the original data for easy plotting
african_rf_data <- african_rf_data %>% mutate(co2_pred = african_rf_pred)
#Calculate R2
african_rf_r2 <- 1 - (sum((african_rf_data$co2_per_cap-african_rf_data$co2_pred)^2)/sum((african_rf_data$co2_per_cap-mean(african_rf_data$co2_per_cap))^2))
african_rf_rmse <- RMSE(african_rf_pred, african_rf_data$co2_per_cap )
#Plot actual vs predicted
plot_country <- ggplot() + theme_bw() +
geom_point(
data = african_rf_data,
mapping = aes(x = Year, y = co2_per_cap, color = "real")
) +
geom_point(
data = african_rf_data,
mapping = aes(x = Year, y = co2_pred, color = "predicted")
) +
ggtitle(paste( countryname, "; R-squared: ", round(african_rf_r2 * 100,0),"%" , "; RMSE:", round(african_rf_rmse, 6))) +
theme(plot.title = element_text(hjust = 0.5)) +
ylab("CO2 emissions per capita (kt)")
#plot(plot_country)
plot_varImp <- vip(african_rf_model, color = 'red', fill='light blue')
grid.arrange(plot_country, plot_varImp, nrow = 2)
metrics <- list( r2 = african_rf_r2, mse = african_rf_rmse)
return (metrics)
}
#Get the list
rf_r2 <- list()
afr <- levels(reshaped_african$country)
afr
## [1] "Botswana" "Ethiopia" "Ghana" "Morocco" "South Africa"
## [6] "Zimbabwe"
for (i in seq(1:length(afr))) {
rf_r2[i] <- func_rf_model(afr[i])
}
With the help of the RF model, we are able to see the carbon emission of each countries and which economic indicators negatively affect our ecosystems. The metric summary table will be shown in the next section. But as we can see, although the random forest model perform really good in the dataset with lower RMSE, it is still very obvious that it is South Africa emitting much more co2 than the rest of the countries.
For South African country, the RF model seems to have predicted the true carbon dioxide emission over the year. “Rural population (% of total population)”, “Population (Total)” and “Electric power consumption (kWh per capita)” are the top 3 indicators that has a big contribution to the carbon dioxide emission. Morocco and Botswana are emitting the least co2, but even then it is still the overall and rural population as well as Electric power consumption that are the main cause for carbon emission.
As a comparison to the random forest model, an easy linear model is trained to predict the CO2 emissions with all defined predictors.
# define function for linear model
func_linear_model <- function(countryname) {
# Create a dataset for south africa
sa_rf_data <- filter(reshaped_african, country==countryname)
sa_rf_data <- sa_rf_data[c(2:13)]
#fit linear model
african_linear_model <- lm(co2_per_cap ~ . -Year, data = sa_rf_data)
african_rf_pred <- predict(african_linear_model, new_data = sa_rf_data)
#Add the predicted values in the original data for easy plotting
sa_rf_data <- sa_rf_data %>% mutate(co2_pred = african_rf_pred)
#Plot actual vs predicted
plot_country <- ggplot() + theme_bw() +
geom_point(
data = sa_rf_data,
mapping = aes(x = Year, y = co2_per_cap, color = "real")
) +
geom_point(
data = sa_rf_data,
mapping = aes(x = Year, y = co2_pred, color = "predicted")
) +
ggtitle(paste("CO2 Emission for", countryname)) +
ylab("CO2 emissions per capita [-]")
#plot(plot_country)
cof <- african_linear_model$coefficients[2:length(african_linear_model$coefficients)]
cof <- data.frame(names(cof), cof)
plot_bars <- ggplot(data = cof,
mapping = aes(x=names.cof., y=cof)) + theme_bw() +
geom_bar(stat = "identity") +
theme(axis.text.x=element_text(angle=45,hjust=1,vjust=1)) +
xlab("Coefficient") +
ylab("Value")
# plot(plot_bars)
grid.arrange(plot_country, plot_bars, nrow = 1)
return (summary(african_linear_model)$r.squared)
}
rsquared_lm <- list()
afr <- levels(reshaped_african$country)
for (i in seq(1:length(afr))) {
rsquared_lm[i] <- func_linear_model(afr[i])
}
The Linear Models fit the data quite well, as it can also be seen in the R2 table in the next chapter. Comparing the coefficient weights for each of the countries, there is no pattern that a special coefficient is always similar important for a model, not even in the same direction (positive/negative influence on the prediction). One example is the alternative and nuclear energy ratio of the total energy use in the countries. We would expect that if the percentage is high, there must be a strong negative influence on the CO2 emissions per capita. However, the coefficient is only slightly negative in some of the countries. In Botswana it is even highly positive, so that a high percentage of alternative and nuclear energy would cause higher CO2 emissions. In some models, the agricultural land, the rural population ratio and the adjusted net national income are important predictors, but they are also sometimes positive and sometimes negative. A good example is the comparison of agricultural land ratio between Botswana and south Africa, which is one very positive and once very negative.
A reason for this discrepancies might be that the many predictors are highly correlated with each other, and that there is also the problem that there are so many other influencing indicators missing that we filtered out in the beginning of the preprocessing.
We see from that result that we cannot draw any conclusions about the influence of a single predictor on the CO2 emissions that we could generalize over multiple countries.
rsquared <- data.frame(afr, as.numeric(rsquared_lm), as.numeric(rf_r2))
names(rsquared) <- c("Country", "R2 Linear Model", "R2 Random Forest")
knitr::kable(rsquared)
| Country | R2 Linear Model | R2 Random Forest |
|---|---|---|
| Botswana | 0.9680402 | 0.9518020 |
| Ethiopia | 0.6258348 | 0.6974389 |
| Ghana | 0.8795509 | 0.7791209 |
| Morocco | 0.9889679 | 0.9671508 |
| South Africa | 0.9396181 | 0.8140692 |
| Zimbabwe | 0.9103420 | 0.8155649 |
By only comparing the R2 values of the fitted Random Forest and Linear Models for the African Countries, we can see that the data can be fitted quite well. The only exception is Ethiopia, which has a higher variability in its data. The Linear Models perform better in all models except for Ethiopia, where the Random Forest model has a better R2 score. There was no separate test data used to compare the models, so they might be overfitted.
Going back to the Adjusted net national income per capita (current US$). It is clear to see that the national income has been constant to 5 African countries. However we might notice that two countries do not follow this pattern. We will tease this factor apart by fitting a model with linear trends. The model captures constant national income and the residual will show what is left
net_income <- reshaped_african %>%
ggplot(aes(Year, adj_net_national_income, color=country)) +
geom_line(alpha = 1/3) +
ggtitle("Yearly Adjusted Net National Income per Capita ")
net_income
Let us do simple linear model for a single country South afrifa.
library(modelr)
library(gridExtra)
south_africa <- filter(reshaped_african, country=="South Africa")
#full South African data
full_data <- south_africa %>%
ggplot(aes(Year, adj_net_national_income)) +
geom_line() +
ggtitle("Full data")
#Appling simple linear regression model to south african data
south_africa_lm <- lm(adj_net_national_income ~ Year, data = south_africa)
linear <- south_africa %>%
add_predictions(south_africa_lm) %>%
ggplot(aes(Year, pred)) +
geom_line() +
ggtitle("Linear Trend")
# Residuals
remaining_pattern <- south_africa %>%
add_residuals(south_africa_lm) %>%
ggplot(aes(Year, resid)) +
geom_hline(yintercept = 0, color="white", size=3) +
geom_line() +
ggtitle("Remaining pattern")
grid.arrange(full_data,linear,remaining_pattern, ncol = 3, nrow = 1)
We now have a linear model but only for South African country. We will fit this model to other countries to see how the linear trend looks like. We need a new data structure to create a nested data frame. This creates a dataframe that has one row per group per country. By doing this, we can see then the “GDP per capita (current US$)” across each countries.
by_country <- reshaped_african %>%
group_by(country) %>%
nest()
by_country$data[[1]]
## # A tibble: 36 x 12
## Year adj_net_nationa… agri_land alter_and_nucle… elect_power_con…
## <int> <dbl> <dbl> <dbl> <dbl>
## 1 1972 -0.816 -0.547 -0.310 -0.385
## 2 1973 -0.732 -0.547 -0.310 -0.385
## 3 1974 -0.689 -0.547 -0.310 -0.385
## 4 1975 -0.652 -0.547 -0.310 -0.385
## 5 1976 -0.676 -0.547 -0.310 -0.385
## 6 1977 -0.616 -0.547 -0.310 -0.385
## 7 1978 -0.498 -0.547 -0.310 -0.385
## 8 1979 -0.324 -0.547 -0.310 -0.385
## 9 1980 -0.190 -0.547 -0.310 -0.385
## 10 1981 -0.144 -0.547 -0.918 -0.392
## # … with 26 more rows, and 7 more variables: gdp_per_cap <dbl>,
## # household_consump_expend <dbl>, total_pop <dbl>, rural_pop <dbl>,
## # adj_savings_per_cap <dbl>, co2_per_cap <dbl>, elect_prod_per_cap <dbl>
We now have a nested dataframe we are now in a good position to fit the linear model
library(purrr)
library(dplyr)
#Model function
country_model <- function(reshaped_african){
lm(adj_net_national_income ~ Year, data = reshaped_african)
}
#str(by_country$data)
#Apply model to every data frame. The dataframe is in a list so we can use map from purr package
models <- map(by_country$data, country_model)
#We create a new variable in by_country using mutate
by_country <- by_country %>%
mutate(model = map(data, country_model))
by_country
## # A tibble: 6 x 3
## # Groups: country [6]
## country data model
## <fct> <list> <list>
## 1 Botswana <tibble [36 × 12]> <lm>
## 2 Ethiopia <tibble [36 × 12]> <lm>
## 3 Ghana <tibble [36 × 12]> <lm>
## 4 Morocco <tibble [36 × 12]> <lm>
## 5 South Africa <tibble [36 × 12]> <lm>
## 6 Zimbabwe <tibble [36 × 12]> <lm>
Add the residual to the dataset
by_country <- by_country %>%
mutate(resids= map2(data, model, add_residuals))
resids <- unnest(by_country, resids)
resids %>%
ggplot(aes(Year, resid)) +
geom_line(aes(group = country, color=country), alpha = 1/3) +
geom_smooth(se=FALSE) +
ggtitle("Adjusted net national income per capita (current US$) residuals per Year per Country ")
There is a large residuals which suggest that the model is not fitting so well. Instead of looking at the residuals from the model we could look at some general measurements of model quality in each country.
by_country <- by_country %>%
mutate(glance = map(model, broom::glance)) %>%
unnest(glance, .drop=T)
by_country
## # A tibble: 6 x 15
## # Groups: country [6]
## country data model resids r.squared adj.r.squared sigma statistic p.value
## <fct> <lis> <lis> <list> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 Botswa… <tib… <lm> <tibb… 0.917 0.915 0.340 377. 5.59e-20
## 2 Ethiop… <tib… <lm> <tibb… 0.675 0.665 0.122 70.6 8.23e-10
## 3 Ghana <tib… <lm> <tibb… 0.178 0.154 0.116 7.35 1.04e- 2
## 4 Morocco <tib… <lm> <tibb… 0.844 0.840 0.167 184. 2.80e-15
## 5 South … <tib… <lm> <tibb… 0.740 0.732 0.479 96.8 1.77e-11
## 6 Zimbab… <tib… <lm> <tibb… 0.358 0.340 0.120 19.0 1.15e- 4
## # … with 6 more variables: df <int>, logLik <dbl>, AIC <dbl>, BIC <dbl>,
## # deviance <dbl>, df.residual <int>
with this data in hand we can start to look for models that do not fit
bad_fit <- filter(by_country, r.squared < 0.3)
reshaped_african %>%
semi_join(bad_fit , by = "country") %>%
ggplot(aes(Year, adj_net_national_income , color= country)) +
geom_line()
Based on the results, Ghana seems to be left out from the rest of the African countries. It would be great if the world organization or so would give attention into them to better their economic life with respect to their income. It would also be good to see why Ghana has been left out from the rest. To do this, let us do a Multiple linear regression predicting “Adjusted net national income per capita (current US$)” based on the some of the available indicators
ghana_data <- filter(reshaped_african, country=="Ghana")
# fit multiple lm model gdp_per_cap
ghana_linear_model <- lm( adj_net_national_income ~ agri_land + household_consump_expend + total_pop + rural_pop + adj_savings_per_cap , data = ghana_data)
# plot coefficient
coefplot(ghana_linear_model) +
ggtitle("Ghana Coef") +
ylab("Indicators")
Of course the population has a big influence on the national net income as well as the agricultural land but this does not really tell us the factors affecting Ghana’s economical income growth. For that to figure out we would need indicators such as national accounts, production , public finance and so on. So we take the analysis until here.
For this project, we took on the competition “United Nations Millennium Development Goals” hosted by DrivenData. The original data contains 195402 observations with 40 variables which includes economic data from 1972 to 2007 and 1305 economic indicators of 214 countries. For this project we preprocessed the data intensively leaving only 11 indicators and 6 countries.
With the help of the plots, we have some understanding of the dataset. We have predicted carbon dioxide emission for each randomly selected African country using Random Forest and Linear regression model. Both models have explained the variability of the latter countries having approximately 60% to 99% R2. With the help of these models we have found out some indicators that have contributed to the African countries emitting co2. This could help us have an idea to ensure the environmental sustainibility.
A linear model has also been implemented to see how the economic stability of each countries in the aspect of “Adjusted Net National Income per capita” over time this could give us an idea for the development for a global partnership so other countries could attain a comfortable economic life.
There is so much in the dataset that could give us an idea to help for global development. But for this project we take it until here.